-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Usability of ASCII Strings #1304
base: main
Are you sure you want to change the base?
Improving Usability of ASCII Strings #1304
Conversation
This pull request has been linked to Shortcut Story #20965: Unicode-to-ASCII and ASCII-to-Unicode in TileDB-Py. |
* Previously `TILEDB_STRING_ASCII` data was inconsistently displayed as `bytes` * There is a need to coerce to `str` everywhere because (1) previously the resulting dataframe displayed ASCII as bytes with Pyarrow disabled but as str with Pyarrow enabled, and (2) this fix would remove the need to copy large amounts of data to convert back and forth in the TileDB-SingleCell Python API * Warning now emitted to the user to pass `dtype="ascii"` for string dim types in lieu of `np.bytes_` or `np.str_` for clarity. All three still work and under the hood use `np.str_` and `TILEDB_STRING_ASCII` * `repr` of string dimensions is now always displayed as `dtype="ascii"`. Calling `.dtype()` will return `np.dtype('U')` as the return signature of `dtype` requires a Numpy dtype
3202aee
to
dead46e
Compare
@nguyenv testing now -- thanks! :) |
@nguyenv using SOMA The Also, I see:
This is (as a reminder) in reference to single-cell-data/TileDB-SOMA#99 -- please see there for the reason that we cannot store these attributes as Unicode within the TileDB storage. |
We have resolved the original problem that prompted this PR. However, this branch still contains several features that may be important for usability such as consistent presentation of ASCII as |
TILEDB_STRING_ASCII
Now Displaying As UTF-8 / str
Everywhere
TILEDB_STRING_ASCII
data was inconsistently displayedas
bytes
str
everywhere because (1) previouslythe resulting dataframe displayed ASCII as bytes with Pyarrow disabled
but as str with Pyarrow enabled, and (2) this fix would remove the
need to copy large amounts of data to convert back and forth in the
TileDB-SingleCell Python API
dtype="ascii"
for string dimtypes in lieu of
np.bytes_
ornp.str_
for clarity. All three stillwork and under the hood use
np.str_
andTILEDB_STRING_ASCII
repr
of string dimensions is now always displayed asdtype="ascii"
.Calling
.dtype()
will returnnp.dtype('U')
as the return signatureof
dtype
requires a Numpy dtype